Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (9561, 20) | Test shape: (818, 20)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (9561, 20) | Test shape: (818, 20)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 16
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9095
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.8631 | 0.6327 | 0.2889 | 0.3514 | 0.3171 | 0.6122 | 0.2094 |
| Random Forest (SMOTE) | 0.8007 | 0.6775 | 0.2335 | 0.5270 | 0.3237 | 0.8286 | 0.2972 |
| LightGBM | 0.7995 | 0.6890 | 0.2384 | 0.5541 | 0.3333 | 0.8431 | 0.3639 |
| Balanced RF | 0.8447 | 0.6834 | 0.2880 | 0.4865 | 0.3618 | 0.8447 | 0.3575 |
| SGD SVM | 0.8753 | 0.5725 | 0.2586 | 0.2027 | 0.2273 | nan | nan |
| IsolationForest | 0.8447 | 0.5496 | 0.1728 | 0.1892 | 0.1806 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 680 | 64 | 48 | 26 | 8.60% | 64.86% |
| Random Forest (SMOTE) | 616 | 128 | 35 | 39 | 17.20% | 47.30% |
| LightGBM | 613 | 131 | 33 | 41 | 17.61% | 44.59% |
| Balanced RF | 655 | 89 | 38 | 36 | 11.96% | 51.35% |
| SGD SVM | 701 | 43 | 59 | 15 | 5.78% | 79.73% |
| IsolationForest | 677 | 67 | 60 | 14 | 9.01% | 81.08% |
Best Models by Metric
Accuracy
SGD SVM
0.8753
Balanced Acc
LightGBM
0.6890
Precision
Logistic Regression
0.2889
Recall
LightGBM
0.5541
F1
Balanced RF
0.3618
ROC-AUC
Balanced RF
0.8447
PR-AUC
LightGBM
0.3639
Lowest False Positive Rate
SGD SVM
5.78%
Lowest Miss Rate
LightGBM
44.59%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9341 | 0.9140 | 0.9239 | 744.0000 |
| 1 | 0.2889 | 0.3514 | 0.3171 | 74.0000 |
| accuracy | nan | nan | 0.8631 | 818.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9462 | 0.8280 | 0.8832 | 744.0000 |
| 1 | 0.2335 | 0.5270 | 0.3237 | 74.0000 |
| accuracy | nan | nan | 0.8007 | 818.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9489 | 0.8239 | 0.8820 | 744.0000 |
| 1 | 0.2384 | 0.5541 | 0.3333 | 74.0000 |
| accuracy | nan | nan | 0.7995 | 818.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9452 | 0.8804 | 0.9116 | 744.0000 |
| 1 | 0.2880 | 0.4865 | 0.3618 | 74.0000 |
| accuracy | nan | nan | 0.8447 | 818.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9224 | 0.9422 | 0.9322 | 744.0000 |
| 1 | 0.2586 | 0.2027 | 0.2273 | 74.0000 |
| accuracy | nan | nan | 0.8753 | 818.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9186 | 0.9099 | 0.9142 | 744.0000 |
| 1 | 0.1728 | 0.1892 | 0.1806 | 74.0000 |
| accuracy | nan | nan | 0.8447 | 818.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (9561, 34) | Test shape: (818, 34)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (9561, 34) | Test shape: (818, 34)
• Total train samples: 9,561 | Total test samples: 818
• Number of features: 30
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
• 0: 8,704
• 1: 857
• Class balance (minority/majority): 9.8460%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9095
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.8484 | 0.5820 | 0.2159 | 0.2568 | 0.2346 | 0.5956 | 0.1997 |
| Random Forest (SMOTE) | 0.8778 | 0.6468 | 0.3375 | 0.3649 | 0.3506 | 0.8295 | 0.3570 |
| LightGBM | 0.8778 | 0.6833 | 0.3587 | 0.4459 | 0.3976 | 0.8621 | 0.3801 |
| Balanced RF | 0.8619 | 0.6807 | 0.3178 | 0.4595 | 0.3757 | 0.8437 | 0.3726 |
| SGD SVM | 0.7323 | 0.5182 | 0.1038 | 0.2568 | 0.1479 | nan | nan |
| IsolationForest | 0.8851 | 0.5474 | 0.2500 | 0.1351 | 0.1754 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 675 | 69 | 55 | 19 | 9.27% | 74.32% |
| Random Forest (SMOTE) | 691 | 53 | 47 | 27 | 7.12% | 63.51% |
| LightGBM | 685 | 59 | 41 | 33 | 7.93% | 55.41% |
| Balanced RF | 671 | 73 | 40 | 34 | 9.81% | 54.05% |
| SGD SVM | 580 | 164 | 55 | 19 | 22.04% | 74.32% |
| IsolationForest | 714 | 30 | 64 | 10 | 4.03% | 86.49% |
Best Models by Metric
Accuracy
IsolationForest
0.8851
Balanced Acc
LightGBM
0.6833
Precision
LightGBM
0.3587
Recall
Balanced RF
0.4595
F1
LightGBM
0.3976
ROC-AUC
LightGBM
0.8621
PR-AUC
LightGBM
0.3801
Lowest False Positive Rate
IsolationForest
4.03%
Lowest Miss Rate
Balanced RF
54.05%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9247 | 0.9073 | 0.9159 | 744.0000 |
| 1 | 0.2159 | 0.2568 | 0.2346 | 74.0000 |
| accuracy | nan | nan | 0.8484 | 818.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9363 | 0.9288 | 0.9325 | 744.0000 |
| 1 | 0.3375 | 0.3649 | 0.3506 | 74.0000 |
| accuracy | nan | nan | 0.8778 | 818.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9435 | 0.9207 | 0.9320 | 744.0000 |
| 1 | 0.3587 | 0.4459 | 0.3976 | 74.0000 |
| accuracy | nan | nan | 0.8778 | 818.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9437 | 0.9019 | 0.9223 | 744.0000 |
| 1 | 0.3178 | 0.4595 | 0.3757 | 74.0000 |
| accuracy | nan | nan | 0.8619 | 818.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9134 | 0.7796 | 0.8412 | 744.0000 |
| 1 | 0.1038 | 0.2568 | 0.1479 | 74.0000 |
| accuracy | nan | nan | 0.7323 | 818.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9177 | 0.9597 | 0.9382 | 744.0000 |
| 1 | 0.2500 | 0.1351 | 0.1754 | 74.0000 |
| accuracy | nan | nan | 0.8851 | 818.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.